This article described the framework and presented the development a Machine Learning model with ensemble learning for predicting hospital readmissions. In Part 1, I detailed the process to clean up and prepare the dataset, Diabetes 130-US hospitals for years 1999-2008 Data Set, downloaded from UCI Machine Learning Repository for Machine Learning. To continue, I employed H2O and conifgured a stacked ensemble for making predictions with algorithms or learners including:

The logical steps for constructing and conducting ensemble learning with pertinent information are highlighted as the following:

Data Set

Notice that the data set employed for developing the ensemble described in this article was slightly different from the finalized data set in Part 1. Nevertheless, the process for preparing the data set was very much identical other than the feature set was based on results from a different Boruta run.

## Imported file: dataimp.csv with 70245 obs. and 27 variables

The imported data set had the following characteristics:

## 'data.frame':    70245 obs. of  27 variables:
##  $ race                    : Factor w/ 5 levels "AfricanAmerican",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ gender                  : Factor w/ 2 levels "Femals","Male": 1 2 2 1 1 1 1 1 1 1 ...
##  $ age                     : num  1.5 1.5 1 2 1.5 1.5 1.5 1.5 1 2 ...
##  $ admission_type_id       : Factor w/ 2 levels "k","u": 1 1 2 2 2 2 2 1 2 2 ...
##  $ discharge_disposition_id: Factor w/ 4 levels "d","h","o","u": 1 1 4 1 4 4 4 1 1 4 ...
##  $ admission_source_id     : Factor w/ 5 levels "b","o","r","t",..: 3 2 3 3 2 2 3 4 3 2 ...
##  $ time_in_hospital        : num  2.92 2.92 2.92 2.92 2.92 ...
##  $ num_lab_procedures      : num  -0.80047 1.70905 -0.148 0.00257 1.35772 ...
##  $ num_procedures          : num  0.857 -0.271 -0.271 0.293 0.857 ...
##  $ num_medications         : num  1.359 0.31 0.194 -0.506 1.942 ...
##  $ number_outpatient       : num  -0.275 -0.275 -0.275 -0.275 -0.275 ...
##  $ number_emergency        : num  -0.194 -0.194 -0.194 -0.194 -0.194 ...
##  $ number_inpatient        : num  -0.394 -0.394 -0.394 -0.394 -0.394 ...
##  $ diag_1                  : Factor w/ 9 levels "Circulatory",..: 7 1 6 2 3 9 1 1 2 1 ...
##  $ diag_2                  : Factor w/ 9 levels "Circulatory",..: 7 2 5 2 9 1 5 1 2 1 ...
##  $ diag_3                  : Factor w/ 9 levels "Circulatory",..: 3 2 2 1 1 2 1 1 3 6 ...
##  $ number_diagnoses        : num  0.341 0.341 -1.679 -2.185 0.846 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 4 3 4 1 3 3 3 1 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 3 2 2 2 3 2 2 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 4 2 2 3 3 2 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 1 4 2 2 4 1 3 3 2 3 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 1 1 1 1 1 1 1 1 2 2 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 2 ...
##  $ readmitted              : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 2 1 1 1 ...

Subsetting and Partitioning the Data

Since the Machine Learning results and statistics presented in this article are generated dynmaically, to reduce the wait time and computing resource requirements, here I used a subset of the imported data set for demonstrating the development of an ensemble. I further partitioned the data into three parts for trianing, cross-validation, and testing. The actual data employed for developing and conducting the ensemble learning were with the following configuration:

## Sampling 14048 obs. with indicated percentage into three partitions: 
## 
## Trainging data  ( 60 %) = 8302 obs. 
## Validation data ( 20 %) = 2688 obs. 
## Testing data    ( 20 %) = 3058 obs.

Class Imbalance

While examining the training data, it ws apparent that the label, readmitted, was with highly disproportional distribution of values. This was problematic, as class imbalance tends to overwhelm a model and leads to incorrect classification. Since during training, the model would have learned much more about and become prone to classifying the over-sampled class. On the other hand, the model knows little about the situations to calssify the under-sampled class. Consequently, a model trained with imbalance class data would potentially produce a high misclassification rate.

SMOTE

To circumvent the imbalance issue, I used SMOTE from the package, Data Mining with R(DMwR), to generate a more balanced set of label values for training, as shown below.

Notice the balance between oversampling and undersampling are configurable with perc.over and perc.under as detailed in the documentation.

Ensemble Learning Using H2O

For a complex problem like hospital readmissions, realizing and optimizing biases-variance tradeoff is a challenge. And using ensemble learning to complement some algorithms’ weakness with the others’ strength by evaluating, weighting, combining, and optimizing their results seemed a right strategy and logical approach. The following illustrated the concept of ensemble learning and additional information is available at the source.

As opposed to preparing data with R/RStudio, for constructing an ensemble model, I used locally run H2O which provides a user-friendly framework and essentially eliminates from a Machine Learning developer the mechanics for setting up a cluster and orchestrating cross-validation of each algorithm. One important benefit for me to use H2O is in particular the speed and the relative low resource requirements.

cluster Initialization

First, initialized and started the H2O cluster. In this porject I ran the cluster locally simply with my i7/16 GB RAM laptop. H2O performed satisfactory throughout the project wiht no issue.

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\da\AppData\Local\Temp\Rtmpm6l4sC/h2o_da_started_from_r.out
##     C:\Users\da\AppData\Local\Temp\Rtmpm6l4sC/h2o_da_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 seconds 561 milliseconds 
##     H2O cluster timezone:       America/Chicago 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    1 month and 10 days  
##     H2O cluster name:           H2O_started_from_R_da_bfs143 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.52 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        13579 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.2 (2018-12-20)

Data Frames Conversion

To fit a model, data frames must be loaded into an H2O cluster. Since the data were prepared and stored in memory as R resources, convert them to H2O objects first. I also set up the label, readmitted, as the response variable and the rest as predictors, as shown.

{ # CONVERTING DATA PARTITIONS TO H2O OBJECTS
  training_frame   <- as.h2o(train)
  validation_frame <- as.h2o(valid)
  testing_frame    <- as.h2o(test)

  # SETTING UP THE LABEL
  y <- 'readmitted'
  x <- setdiff(names(training_frame), y)
}

Algorithms/Learners

Hospital readmissions is a classification problem, since a patient is either readmitted or not. To develop ensemble learning, the task at this time was to investigate and select a set of algorithms or learners, complementary to one another to form an ensemble model. There have been a few algorithms known for solving classification problems including Random Forest and Grandient Boosting, both very nicely included in H2O. In this project, all four algorithms included in H2O were configured as learners with most default settings to form the ensemble.

## nfolds = 10 
## seed   = 55
{ #----------
  # LEARNERS
  #----------
  rf <- h2o.randomForest( x, y, model_id='rf' ,nfolds=nfolds ,seed=seed
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)
  gbm <- h2o.gbm(x, y,model_id='gbm',nfolds=nfolds ,seed=seed
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)
  glm <- h2o.glm(x, y, ,model_id='glm',nfolds=nfolds ,seed=seed ,family= family['binomial']
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)
  dl <- h2o.deeplearning(x, y, model_id='dl',nfolds=nfolds ,seed=seed
    ,training_frame = training_frame, validation_frame = validation_frame
    ,fold_assignment='Modulo', keep_cross_validation_predictions=TRUE)

  models <- list(rf@model_id, gbm@model_id, glm@model_id, dl@model_id)
  learners <- c(rf, gbm, glm, dl);saveRDS(learners,paste0(save.dir,'learners.rds'))

}

Stacked Ensemble

The four learners were stacked up to form an ensemble and carry out learning. For the stacking to work, all learners must use the same cross-validation settings.

  #-----------------
  # STACKED ENSEMBLE
  #-----------------
  stacked <- h2o.stackedEnsemble(x, y, seed=seed
      ,model_id='stacked',base_models=models
      ,training_frame = training_frame, validation_frame = validation_frame
  );saveRDS(stacked,paste0(save.dir,'stacked.rds'))

Model Performance

Since this is a classificaiton problem with high class imbalance, “accuracy” is not the best measure due to “Accuracy Paradox.” There are various measures available in H2O for assessing model performance. For a classificaiton problem like the hospital readmissions, I used logloss and AUC to evaluate the performance.

##              train.logloss cv.logloss train.auc cv.auc  
## [1,] rf      0.243299      0.314034   0.980915  0.867404
## [2,] gbm     0.224819      0.304481   0.979687  0.738871
## [3,] glm     0.56802       0.56948    0.779772  0.677925
## [4,] dl      0.213973      1.191293   0.992223  0.770726
## [5,] stacked 0.035222      0.244881   0.999975  0.856268

For logloss, the smaller value, the better, while opposite for the AUC. As expected, the stacked ensemble performed a little better than than that on an individual learner. An ensemble generally should improve some performnce, yet the improvement should not be dramatic like, for example, from poor to exceptional. Regardless, a drastic performnce change of an algorithm in my opinion always warrants further examination.

On cross-validation results, performance poorer than train’s is expected due to overfitting and the question is really how much or bad on overfitting. Deep Learning in this case seemed lost much performance on logloss from 0.4 to 1.62, this degradation was apparently translated into the aggregrated logloss perforamnce of the stacked ensemble. On auc, the stacked model scored almost perfectly (0.9999) during training, which raised my suspecision. Fortunately the model resulted to 0.85 with cross-validation, which was more realistic and what I wanted to see.

ROC Curves

From the ROC curves below,it showed how Random Forest was a strong contributor of the ensemle and performed closely to what the Stacked model had ultimately delivered. It may appear that the other three models were not actively contributing, and perhaps even to be excluded in a final ensemble. This is however not necessarily true. After all, the context of a learning enviroment including the randomization, the composition, and the state of data all could influence an outcome.

Example of Visualizing Confusion Matrix

Here, for demonstration and demonstration only, I showed what the data distribution and associated confusion matrix had looked like based on the test data with a cutoff point at 0.5 which is here arbutrarily chosen, as opposed to being derived from an associated ROC curve.

Closing Thoughts